STATS 32 Session 8: Reproducible research
Kenneth Tay
Oct 17, 2019
Recap of session 7
- Importing data with
readr
- Where does your data live?
- Factors
File paths and working directories
- A character string that tells you the location of a file
- Absolute path: starts from the “root” directory
- e.g.
/Users/kjytay/Downloads/datafile.csv
- Relative path: starts from the current directory (denoted by
.
)
- e.g. If I am in the folder
/Users/kjytay
: ./Downloads/datafile.csv
- e.g. If I am in the folder
/Users/kjytay/Downloads
: ./datafile.csv
or simply datafile.csv
File paths and working directories
- Working directory: where R looks for files that you ask it to load
- Also where R will put any files that you ask it to save
- You can see your current working directory at the top of the console or by typing
getwd()
- You can change the working directory with
setwd()
function or Session > Set Working Directory > …
Factors
- A concept unique to R
- Useful for working with categorical variables: variables that have a fixed and known set of possible values
- Why use factor variables instead of character variables?
- Character variables don’t protect you from typos
- Character variables don’t sort in a useful way
Functions for factors
fct_recode()
: change factor levels
fct_collapse()
and fct_lump()
: reduce the number of factor levels
fct_infreq()
: to sort factor levels by how often they appear
fct_reorder()
: to sort factor levels by some other variable
fct_rev()
: reverse the order of the factor levels
All these functions are part of the forcats
package, which is automatically loaded when you load the tidyverse
package.
Reproducible research: what & why
Reproducible research: publishing data analyses together with their data and code so that others may “reproduce” the findings.
Why reproducible research?
- Increase transparency and robustness of analyses
- Preserve integrity of analyses over time
- Reduce incentive for dishonest practices
R scripts
- An R script is a file containing lines of R code that are meant to be run altogether
- R scripts are typically working files, not intended for presentation
- R scripts have
.R
file extensions
- Comments can be inserted to explain the code
R markdown
RStudio: R markdown is a document format which allows you to “weave together narrative text and code to produce elegantly formatted output.”
Made possible by the knitr
package (Yihui Xie)
R markdown: output (1)
R markdown: output (2)
R markdown: output (3)
R markdown: more details
- Text (written in Markdown), interspersed with code chunks, “knit” into a document using the
knitr
package
- Typically used for presentation
- R markdown files have
.Rmd
extensions
- R markdown cheatsheet and reference guide available here
Surprise: (Almost) all the class material (including slides) was created with R markdown!
Quick intro to Markdown
Markdown is a simple way to convert a text document into a web file (i.e. HTML) with basic styling.
Has support for:
- Headers
- Emphasis (italics, bold,
strikethrough)
- Lists
- Links
- Images
- Etc…
Markdown reference here.
To see how your Markdown (.md) document looks like in real-time, use an online Markdown editor (e.g. dillinger.io)
Today’s dataset: Airbnb listings
Rmd workflow (basic)
- Edit
.Rmd
file in RStudio.
- Knit the document (either by hitting the “Knit” button or using a keyboard shortcut).
- When you press “Knit”, the file is automatically saved.
- Next, RStudio opens a new console, “knits” the document there, then closes that console. No code is run in your original console!
- RStudio creates a
.html
file in the same folder as the .Rmd
file.
- Preview output in the preview pane, or by opening the
.html
file.
- If you want to make changes, go back to Step 1.
Common Rmd chunk options
include = FALSE
: prevents code and results from appearing in the finished file. R Markdown still runs the code in the chunk, and the results can be used by other chunks.
- Useful for decluttering your Rmd output, showing only essential code.
echo = FALSE
: prevents code, but not the results from appearing in the finished file.
- Useful if you just want to show figures but not code that generated it.
eval = FALSE
: Code appears in the output but is not run.
- Useful for presenting code for demonstration purposes.
message = FALSE
: prevents messages that are generated by code from appearing in the finished file.
- Useful for suppressing messages when loading packages.
warning = FALSE
: prevents warnings that are generated by code from appearing in the finished.
- Useful for suppressing warnings when loading packages, plotting data or fitting models.